{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "l-23gBrt4x2B"
      },
      "source": [
        "##### Copyright 2021 The TensorFlow Authors."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "cellView": "form",
        "id": "HMUDt0CiUJk9"
      },
      "outputs": [],
      "source": [
        "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n",
        "# you may not use this file except in compliance with the License.\n",
        "# You may obtain a copy of the License at\n",
        "#\n",
        "# https://www.apache.org/licenses/LICENSE-2.0\n",
        "#\n",
        "# Unless required by applicable law or agreed to in writing, software\n",
        "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
        "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
        "# See the License for the specific language governing permissions and\n",
        "# limitations under the License."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "77z2OchJTk0l"
      },
      "source": [
        "# Migrating feature_columns to TF2\n",
        "\n",
        "<table class=\"tfo-notebook-buttons\" align=\"left\">\n",
        "  <td>\n",
        "    <a target=\"_blank\" href=\"https://www.tensorflow.org/guide/migrate/migrating_feature_columns\">\n",
        "    <img src=\"https://www.tensorflow.org/images/tf_logo_32px.png\" />\n",
        "    View on TensorFlow.org</a>\n",
        "  </td>\n",
        "  <td>\n",
        "    <a target=\"_blank\" href=\"https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/guide/migrate/migrating_feature_columns.ipynb\">\n",
        "    <img src=\"https://www.tensorflow.org/images/colab_logo_32px.png\" />\n",
        "    Run in Google Colab</a>\n",
        "  </td>\n",
        "  <td>\n",
        "    <a target=\"_blank\" href=\"https://github.com/tensorflow/docs/blob/master/site/en/guide/migrate/migrating_feature_columns.ipynb\">\n",
        "    <img src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" />\n",
        "    View source on GitHub</a>\n",
        "  </td>\n",
        "  <td>\n",
        "    <a href=\"https://storage.googleapis.com/tensorflow_docs/docs/site/en/guide/migrate/migrating_feature_columns.ipynb\"><img src=\"https://www.tensorflow.org/images/download_logo_32px.png\" />Download notebook</a>\n",
        "  </td>\n",
        "</table>"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "g3Zltl6_3Gy2"
      },
      "outputs": [],
      "source": [
        "# Temporarily install tf-nightly as the notebook depends on symbols in 2.6.\n",
        "!pip uninstall -q -y tensorflow keras\n",
        "!pip install -q tf-nightly"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "-5jGPDA2PDPI"
      },
      "source": [
        "Training a model will usually come with some amount of feature preprocessing, particularly when dealing with structured data. When training an `tf.estimator.Estimator` in TF1, this feature preprocessing is done with the `tf.feature_column` API. In TF2, this preprocessing can be done directly with Keras layers, called _preprocessing layers_.\n",
        "\n",
        "In this migration guide, you will perform some common feature transformations using both feature columns and preprocessing layers, followed by training a complete model with both APIs.\n",
        "\n",
        "First, start with a couple of necessary imports,"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "iE0vSfMXumKI"
      },
      "outputs": [],
      "source": [
        "import tensorflow as tf\n",
        "import tensorflow.compat.v1 as tf1\n",
        "import math"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "NVPYTQAWtDwH"
      },
      "source": [
        "and add a utility for calling a feature column for demonstration:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "LAaifuuytJjM"
      },
      "outputs": [],
      "source": [
        "def call_feature_columns(feature_columns, inputs):\n",
        "  # This is a convenient way to call a `feature_column` outside of an estimator\n",
        "  # to display its output.\n",
        "  feature_layer = tf1.keras.layers.DenseFeatures(feature_columns)\n",
        "  return feature_layer(inputs)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "GXkmiuwXTS-B"
      },
      "source": [
        "## One-hot encoding integer IDs\n",
        "\n",
        "A common feature transformation is one-hot encoding integer inputs of a known range. Here is an example using feature columns:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "XasXzOgatgRF"
      },
      "outputs": [],
      "source": [
        "categorical_col = tf1.feature_column.categorical_column_with_identity(\n",
        "    'type', num_buckets=3)\n",
        "indicator_col = tf1.feature_column.indicator_column(categorical_col)\n",
        "call_feature_columns(indicator_col, {'type': [0, 1, 2]})"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "iSCkJEQ6U-ru"
      },
      "source": [
        "Using Keras preprocessing layers, these columns can be replaced by a single `tf.keras.layers.CategoryEncoding` layer with `output_mode` set to `'one_hot'`:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "799lbMNNuAVz"
      },
      "outputs": [],
      "source": [
        "one_hot_layer = tf.keras.layers.CategoryEncoding(\n",
        "    num_tokens=3, output_mode='one_hot')\n",
        "one_hot_layer([0, 1, 2])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "5bm9tJZAgpt4"
      },
      "source": [
        "## One-hot encoding string data with a vocabulary\n",
        "\n",
        "Handling string features often requires a vocabulary lookup to translate strings into indices. Here is an example using feature columns to lookup strings and then one-hot encode the indices:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "3fG_igjhukCO"
      },
      "outputs": [],
      "source": [
        "vocab_col = tf1.feature_column.categorical_column_with_vocabulary_list(\n",
        "    'sizes',\n",
        "    vocabulary_list=['small', 'medium', 'large'],\n",
        "    num_oov_buckets=0)\n",
        "indicator_col = tf1.feature_column.indicator_column(vocab_col)\n",
        "call_feature_columns(indicator_col, {'sizes': ['small', 'medium', 'large']})"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "8rBgllRtY738"
      },
      "source": [
        "Using Keras preprocessing layers, use the `tf.keras.layers.StringLookup` layer with `output_mode` set to `'one_hot'`:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "arnPlSrWvDMe"
      },
      "outputs": [],
      "source": [
        "string_lookup_layer = tf.keras.layers.StringLookup(\n",
        "    vocabulary=['small', 'medium', 'large'],\n",
        "    num_oov_indices=0,\n",
        "    output_mode='one_hot')\n",
        "string_lookup_layer(['small', 'medium', 'large'])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "c1CmfSXQZHE5"
      },
      "source": [
        "## Embedding string data with a vocabulary\n",
        "\n",
        "For larger vocabularies, an embedding is often needed for good performance. Here is an example embedding a string feature using feature columns:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "C3RK4HFazxlU"
      },
      "outputs": [],
      "source": [
        "vocab_col = tf1.feature_column.categorical_column_with_vocabulary_list(\n",
        "    'col',\n",
        "    vocabulary_list=['small', 'medium', 'large'],\n",
        "    num_oov_buckets=0)\n",
        "embedding_col = tf1.feature_column.embedding_column(vocab_col, 4)\n",
        "call_feature_columns(embedding_col, {'col': ['small', 'medium', 'large']})"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "3aTRVJ6qZZH0"
      },
      "source": [
        "Using Keras preprocessing layers, this can be achieved by combining a `tf.keras.layers.StringLookup` layer and an `tf.keras.layers.Embedding` layer. The default output for the `StringLookup` will be integer indices which can be fed directly into an embedding.\n",
        "\n",
        "Note that the `Embedding` layer contains trainable parameters. While the `StringLookup` layer can be applied to data inside or outside of a model, the `Embedding` must always be part of a trainable Keras model to function correctly."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "8resGZPo0Fho"
      },
      "outputs": [],
      "source": [
        "string_lookup_layer = tf.keras.layers.StringLookup(\n",
        "    vocabulary=['small', 'medium', 'large'], num_oov_indices=0)\n",
        "embedding = tf.keras.layers.Embedding(3, 4)\n",
        "embedding(string_lookup_layer(['small', 'medium', 'large']))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "3I5loEx80MVm"
      },
      "source": [
        "## Complete training example\n",
        "\n",
        "To show a complete training workflow, first prepare some data with three features of different types:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "D_7nyBee0ZBV"
      },
      "outputs": [],
      "source": [
        "features = {\n",
        "    'type': [0, 1, 1],\n",
        "    'size': ['small', 'small', 'medium'],\n",
        "    'weight': [2.7, 1.8, 1.6],\n",
        "}\n",
        "labels = [1, 1, 0]\n",
        "predict_features = {'type': [0], 'size': ['foo'], 'weight': [-0.7]}"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "e_4Xx2c37lqD"
      },
      "source": [
        "Define some common constants for both TF1 and TF2 workflows:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "3cyfQZ7z8jZh"
      },
      "outputs": [],
      "source": [
        "vocab = ['small', 'medium', 'large']\n",
        "one_hot_dim = 3\n",
        "embedding_dim = 4\n",
        "weight_mean = 2.0\n",
        "weight_variance = 1.0"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ywCgU7CMIfTH"
      },
      "source": [
        "### With feature columns\n",
        "\n",
        "Feature columns must be passed as a list to the estimator on creation, and will be called implicitly during training."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Wsdhlm-uipr1"
      },
      "outputs": [],
      "source": [
        "categorical_col = tf1.feature_column.categorical_column_with_identity(\n",
        "    'type', num_buckets=one_hot_dim)\n",
        "# Convert index to one-hot; e.g. [2] -> [0,0,1].\n",
        "indicator_col = tf1.feature_column.indicator_column(categorical_col)\n",
        "\n",
        "# Convert strings to indices; e.g. ['small'] -> [1].\n",
        "vocab_col = tf1.feature_column.categorical_column_with_vocabulary_list(\n",
        "    'size', vocabulary_list=vocab, num_oov_buckets=1)\n",
        "# Embed the indices.\n",
        "embedding_col = tf1.feature_column.embedding_column(vocab_col, embedding_dim)\n",
        "\n",
        "normalizer_fn = lambda x: (x - weight_mean) / math.sqrt(weight_variance)\n",
        "# Normalize the numeric inputs; e.g. [2.0] -> [0.0].\n",
        "numeric_col = tf1.feature_column.numeric_column(\n",
        "    'weight', normalizer_fn=normalizer_fn)\n",
        "\n",
        "estimator = tf1.estimator.DNNClassifier(\n",
        "    feature_columns=[indicator_col, embedding_col, numeric_col],\n",
        "    hidden_units=[1])\n",
        "\n",
        "def _input_fn():\n",
        "  return tf1.data.Dataset.from_tensor_slices((features, labels)).batch(1)\n",
        "\n",
        "estimator.train(_input_fn)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "qPIeG_YtfNV1"
      },
      "source": [
        "The feature columns will also be used to transform input data when running inference on the model."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "K-AIIB8CfSqt"
      },
      "outputs": [],
      "source": [
        "def _predict_fn():\n",
        "  return tf1.data.Dataset.from_tensor_slices(predict_features).batch(1)\n",
        "\n",
        "next(estimator.predict(_predict_fn))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "baMA01cBIivo"
      },
      "source": [
        "### With Keras preprocessing layers\n",
        "\n",
        "Keras preprocessing layers are more flexible in where they can be called. A layer can be applied directly to tensors, used inside a `tf.data` input pipeline, or built directly into a trainable Keras model.\n",
        "\n",
        "In this example, we will apply preprocessing layers inside a `tf.data` input pipeline. To do this, you can define a separate `tf.keras.Model` to preprocess your input features. This model is not trainable, but is a convenient way to group preprocessing layers."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "NMz8RfMQdCZf"
      },
      "outputs": [],
      "source": [
        "inputs = {\n",
        "  'type': tf.keras.Input(shape=(), dtype='int64'),\n",
        "  'size': tf.keras.Input(shape=(), dtype='string'),\n",
        "  'weight': tf.keras.Input(shape=(), dtype='float32'),\n",
        "}\n",
        "outputs = {\n",
        "  # Convert index to one-hot; e.g. [2] -> [0,0,1].\n",
        "  'type': tf.keras.layers.CategoryEncoding(\n",
        "      one_hot_dim, output_mode='one_hot')(inputs['type']),\n",
        "  # Convert size strings to indices; e.g. ['small'] -> [1].\n",
        "  'size': tf.keras.layers.StringLookup(vocabulary=vocab)(inputs['size']),\n",
        "  # Normalize the numeric inputs; e.g. [2.0] -> [0.0].\n",
        "  'weight': tf.keras.layers.Normalization(\n",
        "      axis=None, mean=weight_mean, variance=weight_variance)(inputs['weight']),\n",
        "}\n",
        "preprocessing_model = tf.keras.Model(inputs, outputs)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "NRfISnj3NGlW"
      },
      "source": [
        "You can now apply this model inside a call to `tf.data.Dataset.map`. Please note that the function passed to `map` will automatically be converted into\n",
        "a `tf.function`, and usual caveats for writing `tf.function` code apply (no side effects)."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "c_6xAUnbNREh"
      },
      "outputs": [],
      "source": [
        "# Apply the preprocessing in tf.data.Dataset.map.\n",
        "dataset = tf.data.Dataset.from_tensor_slices((features, labels)).batch(1)\n",
        "dataset = dataset.map(lambda x, y: (preprocessing_model(x), y),\n",
        "                      num_parallel_calls=tf.data.AUTOTUNE)\n",
        "# Display a preprocessed input sample.\n",
        "next(dataset.take(1).as_numpy_iterator())"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "8_4u3J4NdJ8R"
      },
      "source": [
        "Next, you can define a separate `Model` containing the trainable layers. Note how the inputs to this model now reflect the preprocessed feature types and shapes."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "kC9OZO5ldmP-"
      },
      "outputs": [],
      "source": [
        "inputs = {\n",
        "  'type': tf.keras.Input(shape=(one_hot_dim,), dtype='float32'),\n",
        "  'size': tf.keras.Input(shape=(), dtype='int64'),\n",
        "  'weight': tf.keras.Input(shape=(), dtype='float32'),\n",
        "}\n",
        "# Since the embedding is trainable, it needs to be part of the training model.\n",
        "embedding = tf.keras.layers.Embedding(len(vocab), embedding_dim)\n",
        "outputs = tf.keras.layers.Concatenate()([\n",
        "  inputs['type'],\n",
        "  embedding(inputs['size']),\n",
        "  tf.expand_dims(inputs['weight'], -1),\n",
        "])\n",
        "outputs = tf.keras.layers.Dense(1)(outputs)\n",
        "training_model = tf.keras.Model(inputs, outputs)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ir-cn2H_d5R7"
      },
      "source": [
        "You can now train the `training_model` with `tf.keras.Model.fit`."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "6TS3YJ2vnvlW"
      },
      "outputs": [],
      "source": [
        "# Train on the preprocessed data.\n",
        "training_model.compile(\n",
        "    loss=tf.keras.losses.BinaryCrossentropy(from_logits=True))\n",
        "training_model.fit(dataset)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "pSaEbOE4ecsy"
      },
      "source": [
        "Finally, at inference time, it can be useful to combine these separate stages into a single model that handles raw feature inputs."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "QHjbIZYneboO"
      },
      "outputs": [],
      "source": [
        "inputs = preprocessing_model.input\n",
        "outpus = training_model(preprocessing_model(inputs))\n",
        "inference_model = tf.keras.Model(inputs, outpus)\n",
        "\n",
        "predict_dataset = tf.data.Dataset.from_tensor_slices(predict_features).batch(1)\n",
        "inference_model.predict(predict_dataset)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "IXMBwzggwUjI"
      },
      "source": [
        "Note: Preprocessing layers are not trainable, which allows you to apply them *asynchronously* using with `tf.data`. This has performence benefits, as you can both [prefetch](https://www.tensorflow.org/guide/data_performance#prefetching) batches with preprocessing applied, and free up any accelerators to focus on the differentiable parts of a model. However, when training performance is not important, it is sometimes simpler to add preprocessing layers directly into a complete model. This is often the case during inference, and may also be true for smaller models or when training entirely on a CPU."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "2pjp7Z18gRCQ"
      },
      "source": [
        "## Feature column equivalence table\n",
        "\n",
        "For reference, here is an approximate correspondence between feature columns and\n",
        "preprocessing layers:<table>\n",
        "  <tr>\n",
        "    <th>Feature Column</th>\n",
        "    <th>Keras Layer</th>\n",
        "  </tr>\n",
        "  <tr>\n",
        "    <td>`feature_column.bucketized_column`</td>\n",
        "    <td>`layers.Discretization`</td>\n",
        "  </tr>\n",
        "  <tr>\n",
        "    <td>`feature_column.categorical_column_with_hash_bucket`</td>\n",
        "    <td>`layers.Hashing`</td>\n",
        "  </tr>\n",
        "  <tr>\n",
        "    <td>`feature_column.categorical_column_with_identity`</td>\n",
        "    <td>`layers.CategoryEncoding`</td>\n",
        "  </tr>\n",
        "  <tr>\n",
        "    <td>`feature_column.categorical_column_with_vocabulary_file`</td>\n",
        "    <td>`layers.StringLookup` or `layers.IntegerLookup`</td>\n",
        "  </tr>\n",
        "  <tr>\n",
        "    <td>`feature_column.categorical_column_with_vocabulary_list`</td>\n",
        "    <td>`layers.StringLookup` or `layers.IntegerLookup`</td>\n",
        "  </tr>\n",
        "  <tr>\n",
        "    <td>`feature_column.crossed_column`</td>\n",
        "    <td>Not implemented.</td>\n",
        "  </tr>\n",
        "  <tr>\n",
        "    <td>`feature_column.embedding_column`</td>\n",
        "    <td>`layers.Embedding`</td>\n",
        "  </tr>\n",
        "  <tr>\n",
        "    <td>`feature_column.indicator_column`</td>\n",
        "    <td>`output_mode='one_hot'` or `output_mode='multi_hot'`*</td>\n",
        "  </tr>\n",
        "  <tr>\n",
        "    <td>`feature_column.numeric_column`</td>\n",
        "    <td>`layers.Normalization`</td>\n",
        "  </tr>\n",
        "  <tr>\n",
        "    <td>`feature_column.sequence_categorical_column_with_hash_bucket`</td>\n",
        "    <td>`layers.Hashing`</td>\n",
        "  </tr>\n",
        "  <tr>\n",
        "    <td>`feature_column.sequence_categorical_column_with_identity`</td>\n",
        "    <td>`layers.CategoryEncoding`</td>\n",
        "  </tr>\n",
        "  <tr>\n",
        "    <td>`feature_column.sequence_categorical_column_with_vocabulary_file`</td>\n",
        "    <td>`layers.StringLookup` or `layers.IntegerLookup`</td>\n",
        "  </tr>\n",
        "  <tr>\n",
        "    <td>`feature_column.sequence_categorical_column_with_vocabulary_list`</td>\n",
        "    <td>`layers.StringLookup` or `layers.IntegerLookup`</td>\n",
        "  </tr>\n",
        "  <tr>\n",
        "    <td>`feature_column.sequence_numeric_column`</td>\n",
        "    <td>`layers.Normalization`</td>\n",
        "  </tr>\n",
        "  <tr>\n",
        "    <td>`feature_column.weighted_categorical_column`</td>\n",
        "    <td>`layers.CategoryEncoding`</td>\n",
        "  </tr>\n",
        "</table>\n",
        "\n",
        "\\* `output_mode` can be passed to `layers.CategoryEncoding`, `layers.StringLookup`, and `layers.IntegerLookup`.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "AQCJ6lM3YDq_"
      },
      "source": [
        "## Next Steps\n",
        "\n",
        " - For more information on keras preprocessing layers, see [the guide to preprocessing layers](https://www.tensorflow.org/guide/keras/preprocessing_layers).\n",
        " - For a more in-depth example of applying preprocessing layers to structured data, see [the structured data tutorial](https://www.tensorflow.org/tutorials/structured_data/preprocessing_layers)."
      ]
    }
  ],
  "metadata": {
    "colab": {
      "collapsed_sections": [],
      "name": "migrating_feature_columns.ipynb",
      "toc_visible": true
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
